TCGA BRCA


About Paper

  • Paper link
  • Title : Enhancing breast cancer outcomes with machine learning-driven glutamine metabolic reprogramming signature

Summary

  • Data Analyzed: Data from over 7,000 breast cancer patients across 14 datasets, including single-cell data from 8 patients (43,766 cells total).
  • Methods:
    • Integrated approach using 10 machine learning algorithms in 54 combinations.
    • Analyzed 100 breast cancer signatures.
    • Empirical validation through immunohistochemistry assays.
  • Findings:
    • Identified five consistent glutamine metabolic reprogramming (GMR) genes.
    • Developed a novel GMR model.
  • Model Performance:
    • Demonstrated higher accuracy in predicting recurrence and mortality risks compared to existing methods.
  • Validation: Immunohistochemistry validation in 30 patients supported the findings.
  • Clinical Implications:
    • Model classifies patients into high-risk and low-risk groups.
    • High-risk patients showed poorer outcomes.
    • Differential therapeutic response: low-risk patients may benefit more from immunotherapy; high-risk patients showed sensitivity to chemotherapies like BI-2536 and ispinesib.

Methods and materials

Data acquisition

  • Primary Data Source: Compiled from TCGA, including gene profiles, mutational data, and clinical information for breast cancer, with a focus on samples with survival data.
  • Validation Data: Additional datasets from GEO (studies GSE93601, GSE76250, GSE70947, etc.) were used for cross-validation, improving result reliability.

Single-cell sequencing technique

  • Data Source and Preparation: Utilized single-cell data from GEO (GSE161529). Genes with zero expression were discarded, retaining only those with non-zero expression.
  • Normalization: Expression levels were normalized using the “SC Transform” function in the Seurat R package.
  • Dimensionality Reduction and Clustering: PCA and UMAP were employed for dimensionality reduction. Cell clusters were identified using “FindNeighbors” and “FindClusters” functions in Seurat.
  • Quality Control: Doublets were removed using the DoubletFinder R package. Cells were excluded if they had 15% or more mitochondrial gene content or fewer than 500 genes.
  • Final Cell Count: After quality control, approximately 43,766 cells were retained for in-depth analysis.
  • Cell Typing and Tumor Identification: Celltypist was used to categorize cell types. Tumor cells were identified using the copyKAT algorithm.

Functional analysis

  • Databases Used: Gene expression differences related to glutamine metabolic reprogramming (GMR) between tumor and normal tissues were analyzed using the GO (Gene Ontology) and KEGG (Kyoto Encyclopedia of Genes and Genomes) databases.
  • Analysis Tools: The Enrichplot package and clusterProfiler algorithm were employed to conduct Gene Set Enrichment Analysis (GSEA) focusing on different risk subgroups.
  • Statistical Significance: A False Discovery Rate (FDR) below 0.05 was considered statistically significant for the analysis.

Calculating the GMR-score

  • Data Source and Analysis: Utilized the TCGA-BRCA dataset to perform differential gene expression analysis between tumor and normal breast tissues, identifying 67 differentially expressed genes associated with glutamine metabolism.
  • GMR-score Development: The GMR-score was developed by integrating the expression of these identified genes, selecting genes from the GeneCards database with a relevance score above 8.
  • Visualization Tools: Heatmaps and networks were generated to visualize the expression and interconnections among these GMR genes.
  • Score Calculation: The GMR-score was computed using the ssGSEA and Ucell algorithms applied to both bulk and single-cell data, providing an estimation of glutamine metabolism-related gene activity in breast cancer tissues.
  • Functional Role: The GMR-score estimates glutamine metabolic activity and offers insights into metabolic adaptations within the tumor microenvironment but does not directly measure metabolic flux.
  • Clinical Relevance: Spearman analysis was used to examine the association between the GMR-score and immune cell infiltrations, enhancing the evaluation of the GMR-score’s relevance in breast cancer.

Construction of the GMR model and nomogram

  • Model Development Approach: Adopted a comprehensive workflow by Liu et al. integrating ten computational algorithms: Random Forest (RSF), LASSO, Gradient Boosting Machine (GBM), Survival-SVM, SuperPC, Ridge Regression, plsRcox, CoxBoost, Stepwise Cox, and Elastic Net (Enet). Key algorithms like RSF, LASSO, CoxBoost, and Stepwise Cox were crucial for dimensionality reduction and variable selection.

  • Data and Application: Utilized the TCGA-BRCA dataset as the training cohort, applying the above algorithms to generate a prognostic signature. Performance across TCGA and five external datasets was evaluated using the concordance index (C-index) to measure discriminative ability and establish the most reliable model for BC.

  • Model Validation: The model’s robustness and accuracy were confirmed through calibration curves, decision curve analysis (DCA), and multivariate Cox regression analyses, ensuring the prognostic relevance of the GMR genes identified.

  • Risk Score Calculation: Risk scores were calculated using the formula:

    \[ \text{risk score} = \sum_{i=1}^n (\beta_i \times \text{Exp}_i) \]

    Here, \(n\) represents the number of GMR genes, \(\text{Exp}_i\) is the expression level of each gene, and \(\beta_i\) are the coefficients from the multivariate Cox regression model. Patients were categorized into high-risk and low-risk groups based on these scores.

  • External Validation and Survival Analysis: Several external datasets were used to validate the model’s generalizability and reliability. Survival differences between high-risk and low-risk groups were assessed using Kaplan-Meier survival analysis, with statistical significance set at a p-value less than 0.05.

Genomic alteration analysis

  • Data Source and Initial Analysis: Utilized the TCGA-BRCA database to analyze genetic mutation frequencies and copy number alterations (CNA) between high-risk and low-risk breast cancer (BC) subgroups.

  • Tumor Mutation Burden (TMB): Calculated the TMB for each subgroup from raw mutation files.

  • Mutation Landscape Mapping: Employed maftools to map the mutation landscape, focusing on the top 28 genes with mutation rates exceeding 5%.

  • Mutational Signatures: Used the deconstructSigs package to identify patient-specific mutational signatures, highlighting four significant signatures (SBS1, SBS3, SB11, SBS12) known for elevated mutation frequencies in the BC dataset.

  • Chromosomal Aberrations: Investigated chromosomal aberrations, pinpointing five regions most frequently affected by amplification and deletion events. Special attention was given to four predominant genes located in chromosomal regions 8q24.21 and 9p23.

Identifying TME disparities

  • Immune Cell Infiltration Analysis: Employed the IOBR package, which integrates multiple algorithms—MCPcounter, EPIC, xCell, CIBERSORT, quanTIseq, and TIMER—to conduct a robust and comprehensive assessment of immune cell infiltration in BC patients categorized by the GMR-model.

  • Evaluation of TME Indices: Analyzed the ESTIMATE and TIDE indices to understand the state and structure of the immune microenvironment within the TME. This evaluation is crucial for informing immunotherapy strategies and understanding potential outcomes for BC patients.

  • Quantification of Immune Checkpoints: Further quantified immune checkpoints to provide insight into the immune state, serving as a preliminary tool for predicting patient responsiveness to immune checkpoint inhibitors (ICIs), which are crucial for personalized cancer treatment strategies.

Determining therapeutic targets and drugs

  • Data Source for Compounds: Collected a comprehensive list of 6,125 compounds from the Drug Repurposing Hub, aiming to predict chemotherapeutic responses and identify potential therapeutic targets.

  • Target Selection via Correlation Analysis: Utilized Spearman correlation analysis to select targets associated with BC outcomes, focusing on genes with a correlation coefficient above 0.15 and a P-value below 0.05. For genes linked to adverse prognosis, targeted those with a correlation coefficient below -0.15 and the same P-value criterion, using CERES scores and risk scores data from the Cancer Cell Line Encyclopedia (CCLE).

  • Drug Response Prediction Models: Employed the CTRP and PRISM datasets for extensive drug screening across cancer cell lines. Applied a ridge regression model using the pRRophetic package, predicting drug responses based on expression data. The model’s accuracy was confirmed through 10-fold cross-validation.

  • Connectivity Map (CMap) Analysis: Conducted a CMap analysis to identify promising therapeutic drugs by comparing gene expression profiles between different risk subgroups and analyzing the top 300 genes (150 up-regulated and 150 down-regulated). CMap scores, showing a negative correlation with potential therapeutic efficacy in BC, provided critical insights for drug selection.

Research Workflow for Breast Cancer Study

  1. Data Acquisition
    • Primary Data Source
      • Data compiled from TCGA, focusing on gene profiles, mutational data, and clinical information for breast cancer.
      • Emphasis on samples with survival data.
    • Validation Data
      • Additional datasets from GEO (e.g., GSE93601, GSE76250, GSE70947) for cross-validation.
  2. Single-cell Sequencing Technique
    • Data from GEO (GSE161529).
    • Procedures: Gene filtering, normalization (SC Transform), dimensionality reduction (PCA, UMAP), clustering (FindNeighbors, FindClusters), and quality control (DoubletFinder, mitochondrial content filtering).
  3. Functional Analysis
    • Using GO and KEGG databases to analyze GMR-related gene expression.
    • Tools: Enrichplot package and clusterProfiler for GSEA.
    • Emphasis on statistical significance (FDR < 0.05).
  4. GMR-score Calculation
    • Differential gene expression analysis using TCGA-BRCA.
    • Development and calculation of GMR-score using ssGSEA and Ucell algorithms.
    • Visualization of gene interconnections through heatmaps and networks.
  5. Construction of GMR Model and Nomogram
    • Integration of computational algorithms (e.g., RSF, LASSO, CoxBoost).
    • Application of models to TCGA-BRCA dataset and evaluation using C-index.
    • Validation using external datasets and survival analysis (Kaplan-Meier).
  6. Genomic Alteration Analysis
    • Mutation frequency and CNA analysis using TCGA-BRCA.
    • Mapping mutation landscape and identifying mutational signatures with maftools and deconstructSigs.
  7. Identifying TME Disparities
    • Assessment of immune cell infiltration using the IOBR package.
    • Evaluation of TME indices (ESTIMATE and TIDE).
    • Quantification of immune checkpoints.
  8. Determining Therapeutic Targets and Drugs
    • Compilation of compounds from the Drug Repurposing Hub.
    • Target selection via correlation analysis and predictive modeling using CTRP, PRISM datasets, and pRRophetic.
    • Connectivity Map (CMap) analysis for identifying potential drugs.

Overall flow of this study

Fig1
Fig1

Unraveling GMR complexities at single-cell level.

Fig3
Fig3


Figure 3 Unraveling GMR complexities at single-cell level. (A, B) Distribution of cells collected from tumor and normal tissues of eight patients. (C, D) Distribution of cell clusters and annotated cell types. (E) UMAP plots showing the expression levels of representative marker genes representing nine cell subtypes. (F) Top 3 differentially expressed genes in each cell type. (G) A stacked bar chart showing the fractions of each cell type in normal and tumor tissues. (H) UMAP plots showing the distribution of GMR-scores in each cell. (I) Violin plot demonstrating the difference of GMR-score in each cell type. (J) CopyKat algorithm evaluates the genomic variations. (K) Comparison of GMR-score among normal, tumor diploid and aneuploid epithelial cells.

Developing the GMR prognosis model through machine learning.

Fig5
Figure 5 Developing the GMR prognosis model through machine learning. (A) The C-index of 54 machine learning algorithm combinations in six cohorts. (B) The error rate in several different trees. (C) The importance of each GMR gene. (D) Distribution between risk score and survival status and gene expression. (E) KM survival illustrates the survival probability in these two groups. (F) The kernel-smoothing hazard function plot demonstrates the correlation between relapse hazard and moths in two populations. (G) The ROC curves visualize the AUC values of the GMR-model at one-, three-, and five-year.

Developing the GMR prognosis model through machine learning.

Fig5
Figure 6 Assessment and validation of the GMR-model. (A) Univariate and multivariate Cox regression analysis of prognostic ability for GMR-model and other clinical pathological features. (B) GMR-nomogram was built consisting of risk score, age and stage index to predict 1-, 3-, and 5-year OS of BC. (C) Correction curve demonstrating the observed OS (%) and the predicted OS (%) of the nomogram. (D) DCA curves indicates two extreme lines drawn from treat all and treat none, respectively. (E) Evaluate the accuracy of GMR column charts and ideal curves using the Hosmer-Lemeshow method. (F) 11 ROC curves respectively unfolding the corresponding AUC values of the risk score and ten clinicopathological indexes. ***P < 0.001.

My Application

Download TCGA-BRAC data

library(TCGAbiolinks)

GDCquery

query <- GDCquery(project = "TCGA-BRCA",
                  data.category = "Transcriptome Profiling",
                  data.type = "Gene Expression Quantification",
                  workflow.type = "HTSeq - Counts")

# Download data
GDCdownload(query)